Understanding the Limiting Factors of Topic Modeling via Posterior Contraction Analysis
نویسندگان
چکیده
Topic models such as the latent Dirichlet allocation (LDA) have become a standard staple in the modeling toolbox of machine learning. They have been applied to a vast variety of data sets, contexts, and tasks to varying degrees of success. However, to date there is almost no formal theory explicating the LDA’s behavior, and despite its familiarity there is very little systematic analysis of and guidance on the properties of the data that affect the inferential performance of the model. This paper seeks to address this gap, by providing a systematic analysis of factors which characterize the LDA’s performance. We present theorems elucidating the posterior contraction rates of the topics as the amount of data increases, and a thorough supporting empirical study using synthetic and real data sets, including news and web-based articles and tweet messages. Based on these results we provide practical guidance on how to identify suitable data sets for topic models, and how to specify particular model parameters.
منابع مشابه
A review of text mining approaches and their function in discovering and extracting a topic
Background and aim: Four text mining methods are examined and focused on understanding and identifying their properties and limitations in subject discovery. Methodology: The study is an analytical review of the literature of text mining and topic modeling. Findings: LSA could be used to classify specific and unique topics in documents that address only a single topic. The other three text min...
متن کاملPrioritizing Effective Factors in the Making Ethical Organizations by Using Combined Method of Interpretative Structural Modeling (ISM) and Principal Component Analysis (PCA)
Nowadays Organizations consider ethical principles in the business environment as an advantage and seek to strengthen it. This requires a coherent, interactive and cognitive understanding of the parts of internal and external environment of organization, which leads to the realization of the rights of the beneficiaries of the organization. The purpose of this paper is prioritize the factors in...
متن کاملSolutions of initial and boundary value problems via F-contraction mappings in metric-like space
We present sufficient conditions for the existence of solutions of second-order two-point boundary value and fractional order functional differential equation problems in a space where self distance is not necessarily zero. For this, first we introduce a Ciric type generalized F-contraction and F- Suzuki contraction in a metric-like space and give relevance to fixed point results. To illustrate...
متن کاملAutomatic keyword extraction using Latent Dirichlet Allocation topic modeling: Similarity with golden standard and users' evaluation
Purpose: This study investigates the automatic keyword extraction from the table of contents of Persian e-books in the field of science using LDA topic modeling, evaluating their similarity with golden standard, and users' viewpoints of the model keywords. Methodology: This is a mixed text-mining research in which LDA topic modeling is used to extract keywords from the table of contents of sci...
متن کاملUnderstanding the Role of Job Stress in Safety Climate in a Dairy Industry using Structural Equation Modeling
Background and purpose: The safety climate refers to employees’ perception of safety which can be affected by job-related stress in the workplace. This study aimed to assess the safety climate and investigate the relationship between job stress factors and safety climate dimensions in a dairy industry. Materials and Methods: This was a cross-sectional study. The data was collected using two se...
متن کامل